NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Addressing Global Biodiversity Challenges: Ensuring Long-Term Sustainability of Morphological Data Collection and Reuse through MorphoBank

https://doi.org/10.3897/biss.8.135124

Long-Fox, Brooke; Andruchow-Colombo, Ana; Jariwala, Shreya; O’Leary, Maureen; Berardini, Tanya (August 2024, Biodiversity Information Science and Standards)

Phenotypic, especially morphological, data are highly useful in systematics, taxonomy, and phylogenetics. Despite the increased use of genetic information, phenotypic data are necessary when researching the fossil record and remain useful for living taxa by providing independent evidence for testing molecular clades. MorphoBank is a FAIR (Findable, Accessible, Interoperable, and Reusable) database providing open biodiversity data in the form of morphological characters (O’Leary and Kaufman 2011, O'Leary and Kaufman 2012), a similar concept to GenBank for open access sequence data. MorphoBank enables scientists to share morphological character data associated with their peer-reviewed publications in the form of phylogenetic matrices as Tree analysis using New Technology (TNT) or NEXUS files. MorphoBank hosts 1,738 publicly accessible projects (each MorphoBank project is issued a unique identifier (ID) begining with the letter P followed by a number) with 173,559 images and 1,138 matrices as of July 2024. These data can be downloaded by the public, researchers, and students in the scientific community, where the data can be used for educational purposes or reused in additional phylogenetic analyses. MorphoBank encourages scientists to add content in numerous ways throughout the research process, including while actively working on a morphological matrix or in conjunction with a paper to be published that has a morphological matrix. For example, some large projects, such as P773, represents collaborative research that contains a matrix with 4,541 characters and over 12,000 annotated images. Researchers looking to replicate or utilize the data from this study, a task that would normally be extremely time and labor intensive, are able to quickly and easily download and work with the data in their own analyses. MorphoBank has a team of part-time curators and interns who also add content post-publication. Between 2018 and 2023, MorphoBank staff accounted for 25% of project creation and 41% of project publication. The MorphoBank community members created more projects but published fewer of them in the same time frame. The MorphoBank curation team strives to add the matrices to make the data FAIR. A majority of the data are associated with publications in journals that require a subscription; MorphoBank makes the matrix data available with its complete metadata without a financial access barrier. Data standards for morphological character matrices include scored taxa, full taxonomic names, and complete character names with character state descriptions. Since NEXUS files have varying standardization and syntax (Maddison et al. 1997, Vos et al. 2012), importing a matrix can lead to data errors, which MorphoBank does not accept due to its mission to provide complete and reproducible datasets. Hence, users often add incomplete data as file attachments. To help ensure full data is uploaded, MorphoBank has partnered with journals to ensure instructions to authors or emails to authors of accepted manuscripts make clear the need to upload data matrices to MorphoBank. MorphoBank has been cited over 1,500 times, with increasing citations each year (Fig. 1). We examined the use and impact of MorphoBank data on systematic and phylogenetic research and found that most data are used in phylogenetic analyses, describing new species, and examining diversification of taxonomic groups, which span a wide-range organisms from vertebrates such as dinosaurs, reptiles, and mammals (including studies of human evolution) to plants, invertebrates, and micro-organisms. MorphoBank has developed and implemented an internship program for undergraduate biology students focused on training in phylogenetic data, curation, research writing, and conference presenting. Part of this intership program involves utilizing Artificial Intelligence (AI) to increase efficiency by automating the process of extraction of character name and state data from published articles and integrating them into NEXUS files. Three additional activities help raise awareness and increase community contributions to MorphoBank: A partnership with the American Museum of Natural History (AMNH) was established in Summer 2024 to train volunteer curators.MorphoBank workshops have been developed for in-person (i.e., 12th North American Paleontological Convention in Ann Arbor, Michigan) and virtual (i.e., 3rd Joint Congress on Evolutionary Biology supported by the Society of Systematic Biologists) conferences.Virtual workshops will be offered quarterly to educate the scientific community on ways to add their own phylogenetic data to MorphoBank. A partnership with the American Museum of Natural History (AMNH) was established in Summer 2024 to train volunteer curators. MorphoBank workshops have been developed for in-person (i.e., 12th North American Paleontological Convention in Ann Arbor, Michigan) and virtual (i.e., 3rd Joint Congress on Evolutionary Biology supported by the Society of Systematic Biologists) conferences. Virtual workshops will be offered quarterly to educate the scientific community on ways to add their own phylogenetic data to MorphoBank. The long-term sustainability of MorphoBank depends on success in three areas: Financial: MorphoBank is currently supported by membership fees from academic institutions and museums; institutional support from the non-profit organization Phoenix Bioinformatics; and grants from the United States National Science Foundation. Its future depends on continued and growth in membership.Technical: The over 20-year-old MorphoBank codebase is being completely overhauled to provide better performance, add longer term software stability, and enable easier addition of new features.Scientific: The outreach efforts to increase community awareness and contributions aim to ensure the continued relevance and utility of the resource. Growth in data depth and breadth feeds into making MorphoBank indispensable for research in this scientific domain. Financial: MorphoBank is currently supported by membership fees from academic institutions and museums; institutional support from the non-profit organization Phoenix Bioinformatics; and grants from the United States National Science Foundation. Its future depends on continued and growth in membership. Technical: The over 20-year-old MorphoBank codebase is being completely overhauled to provide better performance, add longer term software stability, and enable easier addition of new features. Scientific: The outreach efforts to increase community awareness and contributions aim to ensure the continued relevance and utility of the resource. Growth in data depth and breadth feeds into making MorphoBank indispensable for research in this scientific domain.
more » « less
Full Text Available
Arabidopsis research in 2030: Translating the computable plant

https://doi.org/10.1111/tpj.70047

Brady, Siobhan; Auge, Gabriela; Ayalew, Mentewab; Balasubramanian, Sureshkumar; Hamann, Thorsten; Inze, Dirk; Saito, Kazuki; Brychkova, Galina; Berardini, Tanya Z; Friesner, Joanna; et al (March 2025, The Plant Journal)

SUMMARY Plants are essential for human survival. Over the past three decades, work with the reference plantArabidopsis thalianahas significantly advanced plant biology research. One key event was the sequencing of its genome 25 years ago, which fostered many subsequent research technologies and datasets. Arabidopsis has been instrumental in elucidating plant‐specific aspects of biology, developing research tools, and translating findings to crop improvement. It not only serves as a model for understanding plant biology and but also biology in other fields, with discoveries in Arabidopsis also having led to applications in human health, including insights into immunity, protein degradation, and circadian rhythms. Arabidopsis research has also fostered the development of tools useful for the wider biological research community, such as optogenetic systems and auxin‐based degrons. This 4th Multinational Arabidopsis Steering Committee Roadmap outlines future directions, with emphasis on computational approaches, research support, translation to crops, conference accessibility, coordinated research efforts, climate change mitigation, sustainable production, and fundamental research. Arabidopsis will remain a nexus for discovery, innovation, and application, driving advances in both plant and human biology to the year 2030, and beyond.
more » « less
Free, publicly-accessible full text available March 1, 2026
Data sharing and ontology use among agricultural genetics, genomics, and breeding databases and resources of the Agbiodata Consortium

https://doi.org/10.1093/database/baad076

Clarke, Jennifer L; Cooper, Laurel D; Poelchau, Monica F; Berardini, Tanya Z; Elser, Justin; Farmer, Andrew D; Ficklin, Stephen; Kumari, Sunita; Laporte, Marie-Angélique; Nelson, Rex T; et al (January 2023, Database)

Abstract Over the last couple of decades, there has been a rapid growth in the number and scope of agricultural genetics, genomics and breeding databases and resources. The AgBioData Consortium (https://www.agbiodata.org/) currently represents 44 databases and resources (https://www.agbiodata.org/databases) covering model or crop plant and animal GGB data, ontologies, pathways, genetic variation and breeding platforms (referred to as ‘databases’ throughout). One of the goals of the Consortium is to facilitate FAIR (Findable, Accessible, Interoperable, and Reusable) data management and the integration of datasets which requires data sharing, along with structured vocabularies and/or ontologies. Two AgBioData working groups, focused on Data Sharing and Ontologies, respectively, conducted a Consortium-wide survey to assess the current status and future needs of the members in those areas. A total of 33 researchers responded to the survey, representing 37 databases. Results suggest that data-sharing practices by AgBioData databases are in a fairly healthy state, but it is not clear whether this is true for all metadata and data types across all databases; and that, ontology use has not substantially changed since a similar survey was conducted in 2017. Based on our evaluation of the survey results, we recommend (i) providing training for database personnel in a specific data-sharing techniques, as well as in ontology use; (ii) further study on what metadata is shared, and how well it is shared among databases; (iii) promoting an understanding of data sharing and ontologies in the stakeholder community; (iv) improving data sharing and ontologies for specific phenotypic data types and formats; and (v) lowering specific barriers to data sharing and ontology use, by identifying sustainability solutions, and the identification, promotion, or development of data standards. Combined, these improvements are likely to help AgBioData databases increase development efforts towards improved ontology use, and data sharing via programmatic means. Database URL https://www.agbiodata.org/databases
more » « less
Full Text Available
Crowdsourcing biocuration: The Community Assessment of Community Annotation with Ontologies (CACAO)

https://doi.org/10.1371/journal.pcbi.1009463

Ramsey, Jolene; McIntosh, Brenley; Renfro, Daniel; Aleksander, Suzanne A.; LaBonte, Sandra; Ross, Curtis; Zweifel, Adrienne E.; Liles, Nathan; Farrar, Shabnam; Gill, Jason J.; et al (October 2021, PLOS Computational Biology)
Ouellette, Francis (Ed.)
Experimental data about gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5,000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a 10-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills.
more » « less
Full Text Available
PhyloGenes: An online phylogenetics and functional genomics resource for plant gene function inference

https://doi.org/10.1002/pld3.293

Zhang, Peifen; Berardini, Tanya Z.; Ebert, Dustin; Li, Qian; Mi, Huaiyu; Muruganujan, Anushya; Prithvi, Trilok; Reiser, Leonore; Sawant, Swapnil; Thomas, Paul D.; et al (December 2020, Plant Direct)

Abstract We aim to enable the accurate and efficient transfer of knowledge about gene function gained fromArabidopsis thalianaand other model organisms to other plant species. This knowledge transfer is frequently challenging in plants due to duplications of individual genes and whole genomes in plant lineages. Such duplications result in complex evolutionary relationships between related genes, which may have similar sequences but highly divergent functions. In such cases, functional inference requires more than a simple sequence similarity calculation. We have developed an online resource, PhyloGenes (phylogenes.org), that displays precomputed phylogenetic trees for plant gene families along with experimentally validated function information for individual genes within the families. A total of 40 plant genomes and 10 non‐plant model organisms are represented in over 8,000 gene families. Evolutionary events such as speciation and duplication are clearly labeled on gene trees to distinguish orthologs from paralogs. Nearly 6,000 families have at least one member with an experimentally supported annotation to a Gene Ontology (GO) molecular function or biological process term. By displaying experimentally validated gene functions associated to individual genes within a tree, PhyloGenes enables functional inference for genes of uncharacterized function, based on their evolutionary relationships to experimentally studied genes, in a visually traceable manner. For the many families containing genes that have evolved to perform different functions, PhyloGenes facilitates the use of evolutionary history to determine the most likely function of genes that have not been experimentally characterized. Future work will enrich the resource by incorporating additional gene function datasets such as plant gene expression atlas data.
more » « less
The Gene Ontology knowledgebase in 2023

https://doi.org/10.1093/genetics/iyad031

Aleksander, Suzi A; Balhoff, James; Carbon, Seth; Cherry, J Michael; Drabkin, Harold J; Ebert, Dustin; Feuermann, Marc; Gaudet, Pascale; Harris, Nomi L; Hill, David P; et al (March 2023, GENETICS)

Abstract The Gene Ontology (GO) knowledgebase (http://geneontology.org) is a comprehensive resource concerning the functions of genes and gene products (proteins and noncoding RNAs). GO annotations cover genes from organisms across the tree of life as well as viruses, though most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms. Here, we provide an updated overview of the GO knowledgebase, as well as the efforts of the broad, international consortium of scientists that develops, maintains, and updates the GO knowledgebase. The GO knowledgebase consists of three components: (1) the GO—a computational knowledge structure describing the functional characteristics of genes; (2) GO annotations—evidence-supported statements asserting that a specific gene product has a particular functional characteristic; and (3) GO Causal Activity Models (GO-CAMs)—mechanistic models of molecular “pathways” (GO biological processes) created by linking multiple GO annotations using defined relations. Each of these components is continually expanded, revised, and updated in response to newly published discoveries and receives extensive QA checks, reviews, and user feedback. For each of these components, we provide a description of the current contents, recent developments to keep the knowledgebase up to date with new discoveries, and guidance on how users can best make use of the data that we provide. We conclude with future directions for the project.
more » « less
Full Text Available
The Gene Ontology resource: enriching a GOld mine

https://doi.org/10.1093/nar/gkaa1113

Carbon, Seth; Douglass, Eric; Good, Benjamin M; Unni, Deepak R; Harris, Nomi L; Mungall, Christopher J; Basu, Siddartha; Chisholm, Rex L; Dodson, Robert J; Hartline, Eric; et al (December 2020, Nucleic Acids Research)
null (Ed.)
Abstract The Gene Ontology Consortium (GOC) provides the most comprehensive resource currently available for computable knowledge regarding the functions of genes and gene products. Here, we report the advances of the consortium over the past two years. The new GO-CAM annotation framework was notably improved, and we formalized the model with a computational schema to check and validate the rapidly increasing repository of 2838 GO-CAMs. In addition, we describe the impacts of several collaborations to refine GO and report a 10% increase in the number of GO annotations, a 25% increase in annotated gene products, and over 9,400 new scientific articles annotated. As the project matures, we continue our efforts to review older annotations in light of newer findings, and, to maintain consistency with other ontologies. As a result, 20 000 annotations derived from experimental data were reviewed, corresponding to 2.5% of experimental GO annotations. The website (http://geneontology.org) was redesigned for quick access to documentation, downloads and tools. To maintain an accurate resource and support traceability and reproducibility, we have made available a historical archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations.
more » « less
Full Text Available

Search for: All records